K-Means ClusteringK-Means Clustering

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares.

Some real-world applications of k-means:

  • Customer segmentation
  • Understand what the visitors of a website are trying to accomplish
  • Pattern recognition
  • Machine learning
  • Data compression
In [1]:
import random 
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs 
%matplotlib inline
/home/asf/anaconda3/envs/DS4A_Gral/lib/python3.8/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.datasets.samples_generator module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.datasets. Anything that cannot be imported from sklearn.datasets is now part of the private API.
  warnings.warn(message, FutureWarning)
In [2]:
np.random.seed(0)

Making random clusters of points by using the make_blobs class. The make_blobs class can take in many inputs, but we will be using these specific ones.

  • Input

    • n_samples: The total number of points equally divided among clusters.
      • Value will be: 5000
    • centers: The number of centers to generate, or the fixed center locations.
      • Value will be: $[[4, 4], [-2, -1], [2, -3],[1,1]]$
    • cluster_std: The standard deviation of the clusters.
      • Value will be: 0.9
  • Output

    • X: Array of shape $[nsamples, nfeatures]$. (Feature Matrix)
      • The generated samples.
    • y: Array of shape $[nsamples]$. (Response Vector)
      • The integer labels for cluster membership of each sample.
In [3]:
X, y = make_blobs(n_samples=5000, centers=[[4,4], [-2, -1], [2, -3], [1, 1]], cluster_std=0.9)
In [4]:
plt.scatter(X[:, 0], X[:, 1], marker='.')
Out[4]:
<matplotlib.collections.PathCollection at 0x7fe0cd6068e0>
In [5]:
k_means = KMeans(init = "k-means++", n_clusters = 4, n_init = 12)
In [6]:
k_means.fit(X)
Out[6]:
KMeans(n_clusters=4, n_init=12)
In [7]:
k_means_labels = k_means.labels_
k_means_labels
Out[7]:
array([1, 2, 2, ..., 3, 1, 1], dtype=int32)
In [8]:
k_means_cluster_centers = k_means.cluster_centers_
k_means_cluster_centersk_means_cluster_centers = k_means.cluster_centers_
k_means_cluster_centers
Out[8]:
array([[ 0.96959198,  0.98543802],
       [-2.03375169, -0.99827293],
       [ 1.99876902, -3.01796355],
       [ 3.97334234,  3.98758687]])
In [9]:
fig = plt.figure(figsize=(6, 4))

# Colors uses a color map, which will produce an array of colors based on
# the number of labels there are. We use set(k_means_labels) to get the
# unique labels.
colors = plt.cm.Spectral(np.linspace(0, 1, len(set(k_means_labels))))

ax = fig.add_subplot(1, 1, 1)

# For loop that plots the data points and centroids.
# k will range from 0-3, which will match the possible clusters that each
# data point is in.
for k, col in zip(range(len([[4,4], [-2, -1], [2, -3], [1, 1]])), colors):

    # Create a list of all data points, where the data poitns that are 
    # in the cluster (ex. cluster 0) are labeled as true, else they are
    # labeled as false.
    my_members = (k_means_labels == k)
    
    # Define the centroid, or cluster center.
    cluster_center = k_means_cluster_centers[k]
    
    # Plots the datapoints with color col.
    ax.plot(X[my_members, 0], X[my_members, 1], 'w', markerfacecolor=col, marker='.')
    
    # Plots the centroids with specified color, but with a darker outline
    ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,  markeredgecolor='k', markersize=6)


ax.set_title('KMeans')
ax.set_xticks(())
ax.set_yticks(())
plt.show()

Elbow method

The Elbow method is a very popular technique and the idea is to run k-means clustering for a range of clusters k (let’s say from 1 to 10) and for each value, we are calculating the sum of squared distances from each point to its assigned center(distortions).

When the distortions are plotted and the plot looks like an arm then the “elbow”(the point of inflection on the curve) is the best value of k.

In [10]:
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(X)
    distortions.append(kmeanModel.inertia_)
In [11]:
plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
In [12]:
# !pip install yellowbrick

Using yellowbrick

More info

The KElbowVisualizer implements the “elbow” method to help data scientists select the optimal number of clusters by fitting the model with a range of values for 𝐾.

In [14]:
!pip install yellowbrick
from yellowbrick.cluster import KElbowVisualizer
Collecting yellowbrick
  Downloading yellowbrick-1.3.post1-py3-none-any.whl (271 kB)
     |████████████████████████████████| 271 kB 677 kB/s eta 0:00:01
Requirement already satisfied: scikit-learn>=0.20 in /home/asf/anaconda3/envs/DS4A_Gral/lib/python3.8/site-packages (from yellowbrick) (0.23.1)
Requirement already satisfied: scipy>=1.0.0 in /home/asf/anaconda3/envs/DS4A_Gral/lib/python3.8/site-packages (from yellowbrick) (1.4.1)
Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in /home/asf/anaconda3/envs/DS4A_Gral/lib/python3.8/site-packages (from yellowbrick) (3.2.2)
Requirement already satisfied: numpy<1.20,>=1.16.0 in /home/asf/anaconda3/envs/DS4A_Gral/lib/python3.8/site-packages (from yellowbrick) (1.18.5)
Requirement already satisfied: cycler>=0.10.0 in /home/asf/anaconda3/envs/DS4A_Gral/lib/python3.8/site-packages (from yellowbrick) (0.10.0)
Requirement already satisfied: joblib>=0.11 in /home/asf/anaconda3/envs/DS4A_Gral/lib/python3.8/site-packages (from scikit-learn>=0.20->yellowbrick) (0.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/asf/anaconda3/envs/DS4A_Gral/lib/python3.8/site-packages (from scikit-learn>=0.20->yellowbrick) (2.1.0)
Requirement already satisfied: python-dateutil>=2.1 in /home/asf/anaconda3/envs/DS4A_Gral/lib/python3.8/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/asf/anaconda3/envs/DS4A_Gral/lib/python3.8/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.2.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /home/asf/anaconda3/envs/DS4A_Gral/lib/python3.8/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.4.7)
Requirement already satisfied: six in /home/asf/anaconda3/envs/DS4A_Gral/lib/python3.8/site-packages (from cycler>=0.10.0->yellowbrick) (1.15.0)
Installing collected packages: yellowbrick
Successfully installed yellowbrick-1.3.post1
In [15]:
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,10))

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe0cc156610>

By default, the scoring parameter metric is set to distortion, which computes the sum of squared distances from each point to its assigned center. However, two other metrics can also be used with the KElbowVisualizer – silhouette and calinski_harabasz. The silhouette score calculates the mean Silhouette Coefficient of all samples, while the calinski_harabasz score computes the ratio of dispersion between and within clusters.

In [16]:
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,12), metric='calinski_harabasz', timings=False)

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe0cc052940>

Silhouette Method

The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation).

The range of the Silhouette value is between +1 and -1. A high value is desirable and indicates that the point is placed in the correct cluster. If many points have a negative Silhouette value, it may indicate that we have created too many or too few clusters.

In [17]:
from sklearn.metrics import silhouette_score

sil = []
kmax = 10

# dissimilarity would not be defined for a single cluster, thus, minimum number of clusters should be 2
for k in range(2, kmax+1):
  kmeans = KMeans(n_clusters = k).fit(X)
  labels = kmeans.labels_
  sil.append(silhouette_score(X, labels, metric = 'euclidean'))
In [18]:
plt.figure(figsize=(16,8))
plt.plot(range(2, kmax+1), sil, 'bx-')
plt.xlabel('k')
plt.ylabel('Silhouette Score')
plt.title('The Silhouette Method showing the optimal k')
plt.show()
In [19]:
from yellowbrick.cluster import SilhouetteVisualizer
In [20]:
model = KMeans(4, random_state=42)
visualizer = SilhouetteVisualizer(model, colors='yellowbrick')

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe0c77bb580>

Intercluster Distance Maps

Intercluster distance maps display an embedding of the cluster centers in 2 dimensions with the distance to other centers preserved. E.g. the closer to centers are in the visualization, the closer they are in the original feature space. The clusters are sized according to a scoring metric. By default, they are sized by membership, e.g. the number of instances that belong to each center. This gives a sense of the relative importance of clusters. Note however, that because two clusters overlap in the 2D space, it does not imply that they overlap in the original feature space.

In [21]:
from yellowbrick.cluster import InterclusterDistance
In [22]:
# Instantiate the clustering model and visualizer
model = KMeans(4)
visualizer = InterclusterDistance(model)

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe0c7706490>
In [23]:
from sklearn.cluster import MiniBatchKMeans
from yellowbrick.cluster import intercluster_distance

intercluster_distance(MiniBatchKMeans(4, random_state=0), X)
Out[23]:
InterclusterDistance(ax=<matplotlib.axes._subplots.AxesSubplot object at 0x7fe0c7691d00>,
                     estimator=MiniBatchKMeans(n_clusters=4, random_state=0))

Customer Segmentation with K-Means

In [24]:
import pandas as pd
cust_df = pd.read_csv("Cust_Segmentation.csv")
cust_df.head()
Out[24]:
Customer Id Age Edu Years Employed Income Card Debt Other Debt Defaulted Address DebtIncomeRatio
0 1 41 2 6 19 0.124 1.073 0.0 NBA001 6.3
1 2 47 1 26 100 4.582 8.218 0.0 NBA021 12.8
2 3 33 2 10 57 6.111 5.802 1.0 NBA013 20.9
3 4 29 2 4 19 0.681 0.516 0.0 NBA009 6.3
4 5 47 1 31 253 9.308 8.908 0.0 NBA008 7.2
In [25]:
df = cust_df.drop('Address', axis=1)
df.head()
Out[25]:
Customer Id Age Edu Years Employed Income Card Debt Other Debt Defaulted DebtIncomeRatio
0 1 41 2 6 19 0.124 1.073 0.0 6.3
1 2 47 1 26 100 4.582 8.218 0.0 12.8
2 3 33 2 10 57 6.111 5.802 1.0 20.9
3 4 29 2 4 19 0.681 0.516 0.0 6.3
4 5 47 1 31 253 9.308 8.908 0.0 7.2
In [26]:
from sklearn.preprocessing import StandardScaler
X = df.values[:,1:]
X
Out[26]:
array([[41.   ,  2.   ,  6.   , ...,  1.073,  0.   ,  6.3  ],
       [47.   ,  1.   , 26.   , ...,  8.218,  0.   , 12.8  ],
       [33.   ,  2.   , 10.   , ...,  5.802,  1.   , 20.9  ],
       ...,
       [25.   ,  4.   ,  0.   , ...,  3.21 ,  1.   , 33.4  ],
       [32.   ,  1.   , 12.   , ...,  0.696,  0.   ,  2.9  ],
       [52.   ,  1.   , 16.   , ...,  3.638,  0.   ,  8.6  ]])
In [27]:
X = np.nan_to_num(X)
Clus_dataSet = StandardScaler().fit_transform(X)
Clus_dataSetX = np.nan_to_num(X)
Clus_dataSet = StandardScaler().fit_transform(X)
Clus_dataSet
Out[27]:
array([[ 0.74291541,  0.31212243, -0.37878978, ..., -0.59048916,
        -0.52379654, -0.57652509],
       [ 1.48949049, -0.76634938,  2.5737211 , ...,  1.51296181,
        -0.52379654,  0.39138677],
       [-0.25251804,  0.31212243,  0.2117124 , ...,  0.80170393,
         1.90913822,  1.59755385],
       ...,
       [-1.24795149,  2.46906604, -1.26454304, ...,  0.03863257,
         1.90913822,  3.45892281],
       [-0.37694723, -0.76634938,  0.50696349, ..., -0.70147601,
        -0.52379654, -1.08281745],
       [ 2.1116364 , -0.76634938,  1.09746566, ...,  0.16463355,
        -0.52379654, -0.2340332 ]])
In [28]:
visualizer = KElbowVisualizer(model, k=(2,12), metric='calinski_harabasz', timings=False)

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figurevisualizer = KElbowVisualizer(model, k=(2,12), metric='calinski_harabasz', timings=False)

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe0c5d19850>
In [29]:
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,10))

visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe0c5c9b8b0>
In [30]:
clusterNum = 4
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)
k_means.fit(X)
labels = k_means.labels_
print(labels)
[0 1 2 0 3 2 2 2 0 1 2 0 0 0 0 0 0 0 2 0 0 0 0 2 1 2 0 0 2 0 1 2 0 0 0 0 0
 0 0 1 0 1 0 1 0 2 0 0 0 2 2 0 0 2 2 0 0 0 2 0 2 0 2 2 0 0 2 0 0 0 2 2 1 0
 0 0 2 0 1 2 2 2 1 0 2 0 0 0 0 0 2 0 0 0 0 2 0 0 0 0 0 1 2 2 0 2 0 0 2 2 2
 0 0 0 0 0 0 2 2 0 0 0 0 2 0 2 0 0 0 0 0 2 0 0 0 0 2 0 0 0 0 0 0 0 2 0 2 0
 0 0 0 0 0 0 2 0 1 1 0 2 0 0 2 2 0 0 2 0 0 2 2 0 0 0 0 0 2 0 0 2 0 0 0 1 0
 0 0 0 0 2 0 0 2 0 2 0 0 2 3 0 1 0 0 0 0 2 0 3 1 0 0 2 2 2 0 0 2 2 2 2 0 1
 0 0 0 0 1 0 0 2 0 0 2 0 2 0 0 0 2 0 0 0 0 0 0 1 2 2 0 0 0 0 0 0 2 0 0 0 0
 0 0 2 2 2 2 0 0 2 0 0 0 0 0 2 0 0 0 2 0 0 0 1 1 0 1 0 2 0 1 2 0 0 0 0 0 0
 0 0 0 2 2 2 0 0 0 2 2 0 0 0 2 0 0 0 2 0 0 0 0 0 2 0 2 0 0 2 0 0 1 2 2 1 0
 0 2 0 0 2 0 2 0 2 0 0 2 0 0 2 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 2 0 2 1 1 2
 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0
 2 1 0 0 2 0 0 0 0 2 0 2 0 2 2 0 0 2 0 0 0 0 0 2 0 0 0 2 0 0 0 2 2 2 0 0 3
 2 0 2 2 2 0 2 0 0 0 1 0 0 0 0 2 0 1 0 0 0 0 2 0 2 2 2 0 2 1 2 0 2 0 0 0 2
 0 2 0 0 0 0 1 0 0 0 2 0 2 0 0 0 2 2 0 0 0 1 1 0 0 0 0 2 0 0 0 0 1 2 0 0 0
 0 2 2 0 0 0 0 0 0 2 0 0 0 0 3 2 0 0 0 0 0 0 1 0 0 0 2 2 2 0 1 0 2 3 0 1 0
 0 3 0 0 0 2 2 0 2 0 2 2 0 2 0 0 3 0 0 2 0 0 0 0 2 2 0 0 2 2 0 0 0 0 2 0 2
 2 0 0 0 0 0 2 0 0 2 0 2 0 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0 2
 1 0 0 2 2 1 0 0 2 0 2 0 0 1 0 2 2 1 0 0 0 0 0 1 2 0 0 0 2 2 0 0 2 2 1 0 2
 2 0 0 0 2 2 3 0 0 1 0 2 0 0 0 0 0 1 0 2 0 2 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0
 0 0 2 0 0 2 0 1 0 2 1 0 0 0 1 2 2 0 0 0 0 2 1 0 0 0 0 2 2 0 0 1 2 0 2 0 2
 0 2 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 2 0 1 2 0 2 0 2 2 0 0 2 0 0 0 0 0 2 2
 0 0 0 0 0 0 0 2 0 2 0 0 0 2 3 2 2 0 0 0 0 0 0 0 1 0 0 0 0 2 2 2 0 0 0 0 2
 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 2 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 2]
In [31]:
df["Clus_km"] = labels
df.head(5)
Out[31]:
Customer Id Age Edu Years Employed Income Card Debt Other Debt Defaulted DebtIncomeRatio Clus_km
0 1 41 2 6 19 0.124 1.073 0.0 6.3 0
1 2 47 1 26 100 4.582 8.218 0.0 12.8 1
2 3 33 2 10 57 6.111 5.802 1.0 20.9 2
3 4 29 2 4 19 0.681 0.516 0.0 6.3 0
4 5 47 1 31 253 9.308 8.908 0.0 7.2 3
In [32]:
df.groupby('Clus_km').mean()
Out[32]:
Customer Id Age Edu Years Employed Income Card Debt Other Debt Defaulted DebtIncomeRatio
Clus_km
0 433.456172 32.050089 1.611807 5.445438 28.048301 0.930440 1.896669 0.296774 10.122898
1 399.150000 43.416667 2.183333 19.483333 123.400000 3.836667 7.340467 0.108696 9.036667
2 411.262443 39.764706 1.805430 12.923077 62.814480 2.312855 4.445878 0.203297 10.691855
3 453.500000 46.600000 2.300000 21.200000 270.900000 7.884000 13.375200 0.428571 8.210000
In [33]:
import seaborn as sns
plt.figure(figsize=(6, 4))
sns.boxplot(y='Age',x='Clus_km',data=df)
plt.show()
In [34]:
plt.figure(figsize=(6, 4))
sns.boxplot(y='Income',x='Clus_km',data=df)
plt.show()
In [35]:
area = np.pi * ( X[:, 1])**2  
plt.scatter(X[:, 0], X[:, 3], s=area, c=labels.astype(np.float), alpha=0.5)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)
plt.show()
In [36]:
from mpl_toolkits.mplot3d import Axes3D 
fig = plt.figure(1, figsize=(8, 6))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

plt.cla()
# plt.ylabel('Age', fontsize=18)
# plt.xlabel('Income', fontsize=16)
# plt.zlabel('Education', fontsize=16)
ax.set_xlabel('Education')
ax.set_ylabel('Age')
ax.set_zlabel('Income')

ax.scatter(X[:, 1], X[:, 0], X[:, 3], c= labels.astype(np.float))
Out[36]:
<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x7fe0c5bdac70>

Density-Based ClusteringDensity-Based Clustering

Density-based Clustering locates regions of high density that are separated from one another by regions of low density. Density, in this context, is defined as the number of points within a specified radius.

In [37]:
import numpy as np 
from sklearn.cluster import DBSCAN 
from sklearn.datasets.samples_generator import make_blobs 
from sklearn.preprocessing import StandardScaler 
import matplotlib.pyplot as plt 
%matplotlib inline

Data generation

  • centroidLocation: Coordinates of the centroids that will generate the random data.
    • Example: input: $[[4,3], [2,-1], [-1,4]]$
  • numSamples: The number of data points we want generated, split over the number of centroids (# of centroids defined in centroidLocation)
    • Example: 1500
  • clusterDeviation: The standard deviation between the clusters. The larger the number, the further the spacing.
    • Example: 0.5
In [38]:
def createDataPoints(centroidLocation, numSamples, clusterDeviation):
    # Create random data and store in feature matrix X and response vector y.
    X, y = make_blobs(n_samples=numSamples, centers=centroidLocation, 
                                cluster_std=clusterDeviation)
    
    # Standardize features by removing the mean and scaling to unit variance
    X = StandardScaler().fit_transform(X)
    return X, y
In [39]:
X, y = createDataPoints([[4,3], [2,-1], [-1,4]] , 1500, 0.5)
In [40]:
plt.scatter(X[:, 0], X[:, 1], marker='.')
Out[40]:
<matplotlib.collections.PathCollection at 0x7fe0c593d5b0>
In [41]:
from sklearn.metrics import silhouette_samples, silhouette_score
range_eps = [0.1,0.2,0.3,0.4,0.5]

for i in range_eps:
    print(f"eps value is {i}")
    db = DBSCAN(eps=i, min_samples=7).fit(X)
    core_samples_mask = np.zeros_like(db.labels_,dtype=bool)
    core_samples_mask[db.core_sample_indices_] = True
    labels = db.labels_
    print(set(labels))
    silhouette_avg = silhouette_score(X,labels)
    print(f"For eps value = {i} ", labels, f"The average silouette_score is :{silhouette_avg}")
eps value is 0.1
{0, 1, 2, -1}
For eps value = 0.1  [0 1 2 ... 2 1 0] The average silouette_score is :0.7096699806064308
eps value is 0.2
{0, 1, 2, -1}
For eps value = 0.2  [0 1 2 ... 2 1 0] The average silouette_score is :0.7498217605897793
eps value is 0.3
{0, 1, 2}
For eps value = 0.3  [0 1 2 ... 2 1 0] The average silouette_score is :0.8129384026564485
eps value is 0.4
{0, 1, 2}
For eps value = 0.4  [0 1 2 ... 2 1 0] The average silouette_score is :0.8129384026564485
eps value is 0.5
{0, 1}
For eps value = 0.5  [0 1 0 ... 0 1 0] The average silouette_score is :0.624517592713078
In [42]:
epsilon = 0.4
minimumSamples = 7
db = DBSCAN(eps=epsilon, min_samples=minimumSamples).fit(X)
labels = db.labels_
labels
Out[42]:
array([0, 1, 2, ..., 2, 1, 0])

Distinguish outliers

Lets Replace all elements with 'True' in core_samples_mask that are in the cluster, 'False' if the points are outliers.

In [43]:
# Firts, create an array of booleans using the labels from db.
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
core_samples_mask
Out[43]:
array([ True,  True,  True, ...,  True,  True,  True])
In [44]:
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_clusters_
Out[44]:
3
In [45]:
# Remove repetition in labels by turning it into a set.
unique_labels = set(labels)
unique_labels
Out[45]:
{0, 1, 2}
In [46]:
# Create colors for the clusters.
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
In [47]:
# Plot the points with colors
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = 'k'

    class_member_mask = (labels == k)

    # Plot the datapoints that are clustered
    xy = X[class_member_mask & core_samples_mask]
    plt.scatter(xy[:, 0], xy[:, 1],s=50, c=[col], marker=u'o', alpha=0.5)

    # Plot the outliers
    xy = X[class_member_mask & ~core_samples_mask]
    plt.scatter(xy[:, 0], xy[:, 1],s=50, c=[col], marker=u'x', alpha=0.5)

Weather Station Clustering using DBSCAN & scikit-learn

Dataset: Environment Canada Monthly Values for July - 2015

In [48]:
import csv
import pandas as pd
import numpy as np

filename='weather-stations20140101-20141231.csv'

#Read csv
pdf = pd.read_csv(filename)
pdf.head(5)
Out[48]:
Stn_Name Lat Long Prov Tm DwTm D Tx DwTx Tn ... DwP P%N S_G Pd BS DwBS BS% HDD CDD Stn_No
0 CHEMAINUS 48.935 -123.742 BC 8.2 0.0 NaN 13.5 0.0 1.0 ... 0.0 NaN 0.0 12.0 NaN NaN NaN 273.3 0.0 1011500
1 COWICHAN LAKE FORESTRY 48.824 -124.133 BC 7.0 0.0 3.0 15.0 0.0 -3.0 ... 0.0 104.0 0.0 12.0 NaN NaN NaN 307.0 0.0 1012040
2 LAKE COWICHAN 48.829 -124.052 BC 6.8 13.0 2.8 16.0 9.0 -2.5 ... 9.0 NaN NaN 11.0 NaN NaN NaN 168.1 0.0 1012055
3 DISCOVERY ISLAND 48.425 -123.226 BC NaN NaN NaN 12.5 0.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 1012475
4 DUNCAN KELVIN CREEK 48.735 -123.728 BC 7.7 2.0 3.4 14.5 2.0 -1.0 ... 2.0 NaN NaN 11.0 NaN NaN NaN 267.7 0.0 1012573

5 rows × 25 columns

In [49]:
pdf = pdf[pd.notnull(pdf["Tm"])]
pdf = pdf.reset_index(drop=True)
pdf.head(5)
Out[49]:
Stn_Name Lat Long Prov Tm DwTm D Tx DwTx Tn ... DwP P%N S_G Pd BS DwBS BS% HDD CDD Stn_No
0 CHEMAINUS 48.935 -123.742 BC 8.2 0.0 NaN 13.5 0.0 1.0 ... 0.0 NaN 0.0 12.0 NaN NaN NaN 273.3 0.0 1011500
1 COWICHAN LAKE FORESTRY 48.824 -124.133 BC 7.0 0.0 3.0 15.0 0.0 -3.0 ... 0.0 104.0 0.0 12.0 NaN NaN NaN 307.0 0.0 1012040
2 LAKE COWICHAN 48.829 -124.052 BC 6.8 13.0 2.8 16.0 9.0 -2.5 ... 9.0 NaN NaN 11.0 NaN NaN NaN 168.1 0.0 1012055
3 DUNCAN KELVIN CREEK 48.735 -123.728 BC 7.7 2.0 3.4 14.5 2.0 -1.0 ... 2.0 NaN NaN 11.0 NaN NaN NaN 267.7 0.0 1012573
4 ESQUIMALT HARBOUR 48.432 -123.439 BC 8.8 0.0 NaN 13.1 0.0 1.9 ... 8.0 NaN NaN 12.0 NaN NaN NaN 258.6 0.0 1012710

5 rows × 25 columns

In [50]:
import folium
import re

llon=-140
ulon=-50
llat=40
ulat=65

pdf = pdf[(pdf['Long'] > llon) & (pdf['Long'] < ulon) & (pdf['Lat'] > llat) &(pdf['Lat'] < ulat)]

m = folium.Map(location=[pdf.Lat.mean(), pdf.Long.mean()], zoom_start=9, 
               tiles='Stamen Toner')

for _, row in pdf.iterrows():
    folium.CircleMarker(
        location=[row.Lat, row.Long],
        radius=5,
        popup=re.sub(r'[^a-zA-Z ]+', '', row.Stn_Name),
        color='#1787FE',
        fill=True,
        fill_colour='#1787FE'
    ).add_to(m)
In [51]:
m
Out[51]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [52]:
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)

Clus_dataSet = pdf[['Long','Lat']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)

# Compute DBSCAN
db = DBSCAN(eps=0.15, min_samples=10).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
pdf["Clus_Db"]=labels

realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels)) 


# A sample of clusters
pdf[["Stn_Name","Tx","Tm","Clus_Db"]].head(5)
Out[52]:
Stn_Name Tx Tm Clus_Db
0 CHEMAINUS 13.5 8.2 0
1 COWICHAN LAKE FORESTRY 15.0 7.0 0
2 LAKE COWICHAN 16.0 6.8 0
3 DUNCAN KELVIN CREEK 14.5 7.7 0
4 ESQUIMALT HARBOUR 13.1 8.8 0
In [53]:
set(labels)
Out[53]:
{-1, 0, 1, 2, 3, 4, 5}
In [54]:
cols = ['#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231', '#911eb4',
        '#46f0f0', '#f032e6', '#bcf60c', '#fabebe', '#008080', '#e6beff', 
        '#9a6324', '#fffac8', '#800000', '#aaffc3', '#808000', '#ffd8b1', 
        '#000075', '#808080']*10
In [55]:
def create_map(pdf, cluster_column):
    m = folium.Map(location=[pdf.Lat.mean(), pdf.Long.mean()], zoom_start=9, tiles='Stamen Toner')
    
    for _, row in pdf.iterrows():

        if row[cluster_column] == -1:
            cluster_colour = '#000000'
        else:
            cluster_colour = cols[row[cluster_column]]

        folium.CircleMarker(
            location= [row['Lat'], row['Long']],
            radius=5,
            popup= row[cluster_column],
            color=cluster_colour,
            fill=True,
            fill_color=cluster_colour
        ).add_to(m)
        
    return m

m = create_map(pdf, 'Clus_Db')

m.save('Cluster_DBSCAN.html')
In [56]:
pdf.head()
Out[56]:
Stn_Name Lat Long Prov Tm DwTm D Tx DwTx Tn ... P%N S_G Pd BS DwBS BS% HDD CDD Stn_No Clus_Db
0 CHEMAINUS 48.935 -123.742 BC 8.2 0.0 NaN 13.5 0.0 1.0 ... NaN 0.0 12.0 NaN NaN NaN 273.3 0.0 1011500 0
1 COWICHAN LAKE FORESTRY 48.824 -124.133 BC 7.0 0.0 3.0 15.0 0.0 -3.0 ... 104.0 0.0 12.0 NaN NaN NaN 307.0 0.0 1012040 0
2 LAKE COWICHAN 48.829 -124.052 BC 6.8 13.0 2.8 16.0 9.0 -2.5 ... NaN NaN 11.0 NaN NaN NaN 168.1 0.0 1012055 0
3 DUNCAN KELVIN CREEK 48.735 -123.728 BC 7.7 2.0 3.4 14.5 2.0 -1.0 ... NaN NaN 11.0 NaN NaN NaN 267.7 0.0 1012573 0
4 ESQUIMALT HARBOUR 48.432 -123.439 BC 8.8 0.0 NaN 13.1 0.0 1.9 ... NaN NaN 12.0 NaN NaN NaN 258.6 0.0 1012710 0

5 rows × 26 columns

In [57]:
m
Out[57]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [58]:
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler

sklearn.utils.check_random_state(1000)
Clus_dataSet = pdf[['Long','Lat','Tx','Tm','Tn']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)

# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
pdf["Clus_Db"]=labels

realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels)) 


# A sample of clusters
pdf[["Stn_Name","Tx","Tm","Clus_Db"]].head(5)
Out[58]:
Stn_Name Tx Tm Clus_Db
0 CHEMAINUS 13.5 8.2 0
1 COWICHAN LAKE FORESTRY 15.0 7.0 0
2 LAKE COWICHAN 16.0 6.8 0
3 DUNCAN KELVIN CREEK 14.5 7.7 0
4 ESQUIMALT HARBOUR 13.1 8.8 0
In [59]:
set(labels)
Out[59]:
{-1, 0, 1, 2, 3, 4, 5, 6, 7, 8}
In [60]:
m = create_map(pdf, 'Clus_Db')

m.save('Cluster_DBSCAN_MV.html')
In [61]:
m
Out[61]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [ ]: